119 research outputs found

    A Bootstrapping architecture for time expression recognition in unlabelled corpora via syntactic-semantic patterns

    Get PDF
    In this paper we describe a semi-supervised approach to the extraction of time expression mentions in large unlabelled corpora based on bootstrapping. Bootstrapping techniques rely on a relatively small amount of initial human-supplied examples (termed “seeds”) of the type of entity or concept to be learned, in order to capture an initial set of patterns or rules from the unlabelled text that extract the supplied data. In turn, the learned patterns are employed to find new potential examples, and the process is repeated to grow the set of patterns and (optionally) the set of examples. In order to prevent the learned pattern set from producing spurious results, it becomes essential to implement a ranking and selection procedure to filter out “bad” patterns and, depending on the case, new candidate examples. Therefore, the type of patterns employed (knowledge representation) as well as the ranking and selection procedure are paramount to the quality of the results. We present a complete bootstrapping algorithm for recognition of time expressions, with a special emphasis on the type of patterns used (a combination of semantic and morpho- syntantic elements) and the ranking and selection criteria. Bootstrap- ping techniques have been previously employed with limited success for several NLP problems, both of recognition and classification, but their application to time expression recognition is, to the best of our knowledge, novel. As of this writing, the described architecture is in the final stages of implementation, with experimention and evalution being already underway.Postprint (published version

    Building a Spanish/Catalan health records corpus with very sparse protected information labelled

    Get PDF
    Electronic Health Records (EHR) are an important resource for the research and study of diseases, treatments and symptoms. However, due to data protection laws, information that could potentially compromise privacy must be anonymized before making use of them. Thus, the identification of these pieces of information is mandatory. This identification is usually performed by linguistic models built from EHRs corpora in which Protected Health Information (PHI) has been previously annotated. Nevertheless, two main drawbacks can occur. First, the annotated corpora required to build the models for a particular language may not exist. Second, unannotated corpora might exist for that language, containing very few words related to PHI mentions (i.e., very sparse population). In this situation, the process of manually annotating EHRs results extremely hard and costly, as PHI occurs in very few EHRs. This paper proposes an iterative method for building corpus with labelled PHI from a large unlabelled corpus with a very sparse population of target PHI. The method makes use of manually defined rules specified in the form of Augmented Transition Networks, and tries to minimize the seek of EHRs containing PHI, thus minimizing the cost of manually annotating very sparse EHRs corpora. We use the method with primary care EHRs written in Spanish and Catalan, although it is language-independent and could be applied to EHRs written in other languages. Direct and indirect evaluations performed to the resulting labelled corpus show the appropriateness of our method.Peer ReviewedPostprint (published version

    Talp-UPC at eHealth-KD challenge 2019: A joint model with contextual embeddings for clinical information extraction

    Get PDF
    Most eHealth entity recognition and relation extraction models tackle the identification of entities and relations with independent specialized models. In this article, we show how a single combined model can exploit the correlation between these two tasks to improve the evaluation score of both, while reducing training and execution time. Our model uses both traditional part-of-speech tagging and dependency-parsing of the documents and state-of-the-art pre-trained Contextual Embeddings as input features. Furthermore, Long-Short Term Memory units are used to model close relationships between words while convolution filters are applied for farther dependencies. Our model was able to get the highest score in all three tasks of IberLEF2019’s eHealth-KD competition[7]. This advantage was specially promising in the relation extraction tasks, in which it outperformed the second best model by a margin of 9.3% in F1 Score.Peer ReviewedPostprint (published version

    Topic modeling for entity linking using keyphrase

    Get PDF
    This paper proposes an Entity Linking system that applies a topic modeling ranking. We apply a novel approach in order to provide new relevant elements to the model. These elements are keyphrases related to the queries and gathered from a huge Wikipedia-based knowledge resourcePeer ReviewedPostprint (author’s final draft

    Unsupervised ensemble minority clustering

    Get PDF
    Cluster a alysis lies at the core of most unsupervised learning tasks. However, the majority of clustering algorithms depend on the all-in assumption, in which all objects belong to some cluster, and perform poorly on minority clustering tasks, in which a small fraction of signal data stands against a majority of noise. The approaches proposed so far for minority clustering are supervised: they require the number and distribution of the foreground and background clusters. In supervised learning and all-in clustering, combination methods have been successfully applied to obtain distribution-free learners, even from the output of weak individual algorithms. In this report, we present a novel ensemble minority clustering algorithm, Ewocs, suitable for weak clustering combination, and provide a theoretical proof of its properties under a loose set of constraints. The validity of the assumptions used in the proof is empirically assessed using a collection of synthetic datasets.Preprin

    Non-parametric document clustering by ensemble methods

    Get PDF
    Los sesgos de los algoritmos individuales para clustering no paramétrico de documentos pueden conducir a soluciones no óptimas. Los métodos de consenso podrían compensar esta limitación, pero no han sido probados sobre colecciones de documentos. Este artículo presenta una comparación de estrategias para clustering no paramétrico de documentos por consenso. / The biases of individual algorithms for non-parametric document clustering can lead to non-optimal solutions. Ensemble clustering methods may overcome this limitation, but have not been applied to document collections. This paper presents a comparison of strategies for non-parametric document ensemble clustering.Peer ReviewedPostprint (published version

    Spoken document retrieval based on approximated sequence alignment

    Get PDF
    This paper presents a new approach to spoken document information retrieval for spontaneous speech corpora. The classical approach to this problem is the use of an automatic speech recognizer (ASR) combined with standard information retrieval techniques. However, ASRs tend to produce transcripts of spontaneous speech with significant word error rate, which is a drawback for standard retrieval techniques. To overcome such a limitation, our method is based on an approximated sequence alignment algorithm to search “sounds like” sequences. Our approach does not depend on extra information from the ASR and outperforms up to 7 points the precision of state-of-the-art techniques in our experiments.Peer ReviewedPostprint (author’s final draft

    Negation cues detection using CRF on Spanish product review texts

    Get PDF
    This article describes the negation cue detection approach designed and built by UPC's team participating in NEGES 2018 Workshop on Negation in Spanish. The approach uses supervised CRFs as the base for training the model with several features engineered to tackle the task of negation cue detection in Spanish. The result is evaluated by the means of precision, recall, and F1 score in order to measure the performance of the approach. The approach was ranked in 1st position in the official testing results with average precision around 91%, average recall around 82%, and average F1 score around 86%. © 2018 CEUR-WS. All rights reserved.Postprint (published version

    Coreference Resolution in Freeling 4.0

    Get PDF
    This paper presents the integration of RelaxCor into FreeLing. RelaxCor is a coreference resolution system based on constraint satisfaction that ranked second in the CoNLL-2011 shared task. FreeLing is an open-source library for NLP with more than fifteen years of existence and a widespread user community. We present the difficulties found in porting RelaxCor from a shared task scenario to a production enviroment, as well as the solutions devised. We present two strategies for this integration and a rough evaluation of the obtained resultsPeer ReviewedPostprint (published version

    Summarizing a multimodal set of documents in a Smart Room

    Get PDF
    This article reports an intrinsic automatic summarization evaluation in the scientific lecture domain. The lecture takes place in a Smart Room that has access to different types of documents produced from different media. An evaluation framework is presented to analyze the performance of systems producing summaries answering a user need. Several ROUGE metrics are used and a manual content responsiveness evaluation was carried out in order to analyze the performance of the evaluated approaches. Various multilingual summarization approaches are analyzed showing that the use of different types of documents outperforms the use of transcripts. In fact, not using any part of the spontaneous speech transcription in the summary improves the performance of automatic summaries. Moreover, the use of semantic information represented in the different textual documents coming from different media helps to improve summary quality.Peer ReviewedPostprint (author’s final draft
    • …
    corecore